RAPToR - ShowcaseIn this vignette, you will find a showcase of a few useful ways RAPToR can be used.
The dataset used in this example corresponds to a C. elegans profiling experiment by Lehrbach et al. (2012), hereafter called dslehrbach2012 (Accession : E-MTAB-1333). Profiling was done using microarray (Affymetrix).
The experimental setup is the following. Two C. elegans strains : wild-type (wt) and pash-1(mj100) (mut) were profiled at 0, 6, 12 and 24 hours past the L4 stage (~ 50 hours post-hatching). These were done in triplicates.
In this scenario, performing a Differential Gene Expression (DGE) analysis to uncover the effect of the strain relies on the inclusion of the effect of development. This is true in any time series context, even more so here because the late larval development of C. elegans is known to lead to drastic changes in gene expression within a short time frame.
Code to generate dslehrbach2012 :
Note : set the data_folder variable to an existing path on your system where you want to store the objects.
data_folder <- "../inst/extdata/"
library("biomaRt") # May need to be installed with bioconductor
library("limma") # ..
library("affy") # ..
library("gcrma") # ..
requireNamespace("wormRef", quietly = T)
requireNamespace("utils", quietly = T)
dslehrbach202# get probe ids from the arrayexpress
probe_ids <- read.table("https://www.ebi.ac.uk/arrayexpress/files/A-AFFY-60/A-AFFY-60.adf.txt", h=T,
comment.char = "", sep = "\t", skip = 17, as.is = T)
# pheno data
p_url <- "https://www.ebi.ac.uk/arrayexpress/files/E-MTAB-1333/E-MTAB-1333.sdrf.txt"
p <- read.table(p_url, h = T, sep = "\t", comment.char = "", as.is = T)
# geno data
g_url <- unique(p$Comment..ArrayExpress.FTP.file.)
g_dir <- paste0(data_folder, "dslehrbach2012_raw/")
g_file <- paste0(g_dir, "raw.zip")
if(!file.exists(g_file)){
dir.create(g_dir)
utils::download.file(g_url, destfile = g_file)
}
utils::unzip(g_file, exdir = g_dir)
g <- affy::ReadAffy(filenames = p$Array.Data.File, celfile.path = g_dir, phenoData = p)
g <- affy::expresso(g, bg.correct = F, normalize = F,
pmcorrect.method = "pmonly", summary.method = "median")
g <- 2^exprs(g) # expresso log2s the data
colnames(g) <- p$Source.Name
# format ids
g <- RAPToR::format_ids(g, probe_ids, from = 1, to = 7)
g <- RAPToR::format_ids(g, wormRef::Cel_genes, from = 3, to = 1)
# filter relevant fields
p <- p[, c(1,22,24)]
colnames(p) <- c("title", "tpastL4", "strain")
p$strain <- factor(p$strain, levels = c('wild type', 'pash-1(mj100)'), labels = c('wt', 'strain'))
p$rep <- factor(gsub("\\w+_h\\d+\\.(\\d)","\\1", p$title))
dslehrbach2012 <- list(g = g, p = p)
save(dslehrbach2012, file = paste0(data_folder, "dslerhbach2012.RData"), compress = "xz")
# cleanup
file.remove(g_file)
unlink(g_dir, recursive = T)
rm(g, g_url, g_dir, g_file, p, p_url, probe_ids)
dslehrbach2012$g <- limma::normalizeBetweenArrays(dslehrbach2012$g, method = "quantile")
dslehrbach2012$g <- log(dslehrbach2012$g + 1)
The samples’ (chronological) ages range from after L4 to 24h past L4. The Cel_YA_2 reference from wormRef covers this range, so we’ll select it to stage the samples.
r_ya <- prepare_refdata("Cel_YA_2", datapkg = "wormRef", n.inter = 200)
ae_dslb <- ae(dslehrbach2012$g, r_ya$interpGE, r_ya$time.series)
We can see quite a bit of variation in the sample timings, especially at the 0 and 6 hour timepoints. Another interesting effect can be seen if we group the samples by replicate.
Notice how, in the first two replicates, the mut samples are consistantly older than the wt ones, which is the opposite way around for the 3rd replicate. We can maybe speculate about a potential effect of their sampling order.
cc <- cor(dslehrbach2012$g, dslehrbach2012$g, method = "spearman")
Let’s look at the correlation between the samples and compare it to their estimated age difference. In the graphic below, we plot the correlation score between all sample pairs against their estimated age difference. Each point corresponds to a pair of samples and the color-coding corresponds to whether both samples are of the same strain or not.
Notice how the strain does not really impact correlation between samples, compared to developmental differences. We can also observe this through correlation heatmaps ordered by strain & chronological age or estimated age.
Let’s perform a simple DGE analysis using limma, with the following model :
\[X \sim strain \times \texttt{ns(age, df = 2)} \]
We’re using a spline with \(\texttt{ns()}\) to tackle non-linear dynamics in the data.
We’ll look at the number of genes differentially expressed due to development or strain, which translates to comparing the following nested models (2 vs. 1 for development, 3 vs. 1 for strain).
\[ \begin{align} Y & = \beta_0 + \beta_1 I_{strain} + (\alpha_1 age_{sp1} + \alpha_2 age_{sp2}) + (\gamma_1 I_{strain} age_{sp1} + \gamma_2 I_{strain} age_{sp2})\\ Y & = \beta_0 + \beta_1 I_{strain}\\ Y & = \beta_0 + (\alpha_1 age_{sp1} + \alpha_2 age_{sp2}) \end{align} \]
Note that we’re not interested in whether genes are up or down-regulated, but only in the detection of an effect. Genes will be considered differentially expressed with Bonferoni-Holm adjusted p.values of coefficients below 0.05.
DGE code :
# Predictions from limma model
pred_lmFit <- function(Fit){
tcrossprod(Fit$coefficients, Fit$design)
}
# Compute Goodness of Fit
GoF <- function(Fit, X){
pred <- pred_lmFit(Fit)
res <- (X - pred)
ss <- apply(X, 1, function(ro) sum((ro - mean(ro))^2))
Rsq <- sapply(seq_len(nrow(X)), function(i){
1 - sum(res[i,]^2)/ss[i]
})
return(Rsq)
}
DGE <- function(X, strain, age, df = 2, name = NULL, return.model = FALSE){
require(splines)
if(! length(strain) == ncol(X) | ! length(age) == ncol(X))
stop("strain and age must be of length ncol(X).")
# make pdat df
pdat <- data.frame(strain = factor(strain), age = as.numeric(age), row.names = colnames(X))
# build design matrix
d <- model.matrix(~ 1 + ns(age, df = df) * strain, data = pdat)
# fix colnames
colnames(d) <- c("b0", paste0(rep("a", df), 1L:df), "strainmut", paste0(rep("strainmut.a", df), 1L:df))
# build contrast matrices for mut and age tests
if(df == 2){
cm.mut <- makeContrasts(mut = strainmut, #strainwt - strainmut,
mut.i1 = strainmut.a1,
mut.i2 = strainmut.a2,
levels = d)
cm.age <- makeContrasts(a1, a2,
a.i1 = strainmut.a1,
a.i2 = strainmut.a2,
levels = d)
}
if(df == 3){
cm.mut <- makeContrasts(mut = strainmut, #strainwt - strainmut,
mut.i1 = strainmut.a1,
mut.i2 = strainmut.a2,
mut.i3 = strainmut.a3,
levels = d)
cm.age <- makeContrasts(a1, a2, a3,
a.i1 = strainmut.a1,
a.i2 = strainmut.a2,
a.i3 = strainmut.a3,
levels = d)
}
# fit model
m.0 <- lmFit(object = X, design = d)
# get GoF
gof <- GoF(Fit = m.0, X = X)
# find DE genes for mut
m.m <- contrasts.fit(m.0, contrasts = cm.mut)
m.m <- eBayes(m.m)
# find DE genes for age
m.a <- contrasts.fit(m.0, contrasts = cm.age)
m.a <- eBayes(m.a)
Tmut <- topTable(m.m, adjust.method = "BH", number = Inf, sort.by = "none")[, c("F", "P.Value", "adj.P.Val")]
Tage <- topTable(m.a, adjust.method = "BH", number = Inf, sort.by = "none")[, c("F", "P.Value", "adj.P.Val")]
res <- list(gof = gof,
tmut = Tmut,
tage = Tage,
name = name)
if(return.model)
res$model = m.0
rm(m.0, m.m, m.a, Tmut, Tage, d, X, pdat, gof)
gc(verbose = F)
return(res)
}
We want to perform the same DGE analysis using the chronological age, and our estimates, to see if there is an improvement in detecting strain and age effect.
# format gdata
X <- log2(exp(dslehrbach2012$g))
dge.ca <- DGE(X = X, strain = dslehrbach2012$p$strain, age = dslehrbach2012$p$tpastL4,
name = "dge.ca", return.model = T)
dge.ae <- DGE(X = X, strain = dslehrbach2012$p$strain, age = ae_dslb$age.estimates[,1],
name = "dge.ae", return.model = T)
In the plots below are shown the adjusted p.values for detection of an effect for each gene, when using chronological or estimate age in the model. The red bars correspond to the 0.05 threshold for significance.
mut_dif.ca <- sum(dge.ca$tmut$adj.P.Val < 0.05)
mut_dif.ae <- sum(dge.ae$tmut$adj.P.Val < 0.05)
age_dif.ca <- sum(dge.ca$tage$adj.P.Val < 0.05)
age_dif.ae <- sum(dge.ae$tage$adj.P.Val < 0.05)
100 * (mut_dif.ae - mut_dif.ca) / (mut_dif.ca) # mut pct increase
#> [1] 91.8583
100 * (age_dif.ae - age_dif.ca) / (age_dif.ca) # age pct increase
#> [1] 9.081006
We detect nearly twice as many DE genes for strain using our estimates compared to using the chronological age, and around \(10 \%\) more genes with development. This shows how crucial it can be to take developmental dynamics into account correctly in DGE analyses.
We can also see this increase in the model’s fit itself, by looking at the Goodness of Fit distribution accross genes. The GoF computed is an \(R^2 = 1 - \frac{SS_{res}}{SS_{tot}}\) per gene.
If the asynchronicity observed with age estimates is erroneous, then using random noise around the chronological values should yield similar results to using our estimates in the model. To test this, we can simulate age sets with random noise of similar distribution to the age differences observed between chronological and estimated age.
To compare the results, we use the model’s Goodness of Fit (GoF) per gene. As done above, the GoF computed is an \(R^2 = 1 - \frac{SS_{res}}{SS_{tot}}\).
We also look at the number of differentially expressed genes found for strain and development (BH-adjusted p.value below 0.05).
age_diffs <- (dslehrbach2012$p$tpastL4 + 50) - ae_dslb$age.estimates[,1]
# Note : + 50 to shift tpastL4 to age, avoiding generation of negative tpastl4 values,
# this has no impact on the DGE analysis (it just shifts the age window).
# estimate density function of age_diffs
d_ad <- density(age_diffs)
# generate age sets of with random age_diffs-like noise
set.seed(10)
n <- 100
rd_ages <- lapply(seq_len(n), function(i){
(dslehrbach2012$p$tpastL4 + 50) + sample(x = d_ad$x, size = nrow(dslehrbach2012$p), prob = d_ad$y, replace = T)
})
# setup cluster for parallelization
cl <- makeCluster(3, "FORK")
# do DGE on all age sets
rd_dges <- parLapply(cl, seq_len(n), function(i){
cat("\r", i,"/",n)
DGE(X, dslehrbach2012$p$strain, rd_ages[[i]], name = paste0("rd.",i))
})
stopCluster(cl)
gc()
In the plots above, we can see that using the estimated age leads to better model fits and an increased number of differentially expressed genes.
We can also look at the GoF distributions ordered by chronological age quantiles. The gene GoF are binned along the quantile scale, and shown as boxplots instead of scatterplot for a clearer view.
The Chronological age GoF QQ line is the same as for the previous plot (since the GoF values are ordered according to it).
We can see that using our estimates overall outperforms the random noise sets and the chronological age.
Sometimes, profiled samples may not be from a well-studied organism, with an abundance of reference time series available. However, you can still refer to the closest model organisms as a reference. Though results will not be on par as estimates using a reference from the same species, you can stil get good results thanks to the conserved nature of developmental processes across species.
Let’s stage samples cross-species using orthologs genes.
We’ll be working with 3 datasets here.
dslevin2016dmel and dslevin2016cel respectively. (Accessions : GSE60471 and GSE60755)dsgraveley2011. (Data downloaded from fruitfly.org)Furthermore, we’ll use a set of orthologs between D. melanogaster and C. elegans, from the supplementary data of Li et al. (2014). This list will be stored in the glist object.
Since it can get a little confusing between the plots below, the D.melanogaster samples will always be in red and the C. elegans samples will always be in blue
First, we know the drosophila time series (dslevin2016dmel) has imprecise chronological ages, so we’ll build a reference with the dsgraveley2011 data (as in the second example of the reference-building vignette) and stage the drosophila series on it to get accurate developmental timings.
We’ll filter the data to keep only the orthologs between D. melanogaster and C. elegans (and convert FBgn IDs to WBGene IDs).
Then, we’ll build a reference with the dslevin2016cel data (and stage its samples on it).
Finally, stage the Drosophila samples on it.
We’ll filter the data to only the orthologs between D. melanogaster and C. elegans (and convert WBGene IDs to FBgn IDs).
Then we’ll use the reference built with the dsgraveley2011 to stage the dslevin2016cel samples on it.
Code to generate glist, dslevin2016dmel, dslevin2016cel and dsgraveley2011 :
Note : set the data_folder variable to an existing path on your system where you want to store the objects.
data_folder <- "../inst/extdata/"
requireNamespace("wormRef", quietly = T)
requireNamespace("utils", quietly = T)
requireNamespace("GEOquery", quietly = T) # May need to be installed with bioconductor
requireNamespace("Biobase", quietly = T)
raw2rpkm <- function(X, gene.length, id.col = 1, l.col='length'){
# Compute RPKM from raw counts
if(!all(rownames(X)%in%gene.length[, id.col])){
stop("Some genes are missing length info !")
}
res <- sapply(colnames(X), function(samp){
pm <- sum(X[,samp])/1e6
rpkm <- (X[,samp]/pm)/(gene.length[match(rownames(X), gene.length[, id.col]), l.col]/1000)
})
rownames(res) <- rownames(X)
return(res)
}
requireNamespace("biomaRt", quietly = TRUE)
mart <- biomaRt::useMart("ensembl", dataset = "dmelanogaster_gene_ensembl")
dmel_genes <- biomaRt::getBM(attributes = c("ensembl_gene_id",
"ensembl_transcript_id",
"external_gene_name",
"transcript_length"),
mart = mart)
colnames(dmel_genes)[1:3] <- c("fb_id", "transcript_id", "gene_name")
rm(mart)
glist
Get list of ortholog genes between C. elegans and D. melanogaster from Li et al. (2014) supplementary table 1.
tmp_file <- paste0(data_folder, "dmel_cel_orth.zip")
tmp_fold <- paste0(data_folder, "dmel_cel_orth/")
f_url <- "https://genome.cshlp.org/content/suppl/2014/05/15/gr.170100.113.DC1/Supplemental_Files.zip"
utils::download.file(url = f_url, destfile = tmp_file)
utils::unzip(tmp_file, exdir = tmp_fold)
glist <- read.table(paste0(tmp_fold, "Supplementary\ files/TableS1\ fly-worm\ ortholog\ pairs.txt"),
skip = 1, h=T, sep = "\t", as.is = T, quote = "\"")
colnames(glist) <- c("fb_id", "dmel_name", "cel_id", "cel_name")
glist$wb_id <- wormRef::Cel_genes[match(glist$cel_id, wormRef::Cel_genes$sequence_name), "wb_id"]
save(glist, file = paste0(data_folder, "sc2_glist.RData"), compress = "xz")
file.remove(tmp_file)
unlink(tmp_fold, recursive = T)
rm(tmp_file, tmp_fold, f_url)
dslevin2016dmel
geo_dslevin2016dmel <- "GSE60471"
g_url_dslevin2016dmel <- GEOquery::getGEOSuppFiles(geo_dslevin2016dmel, makeDirectory = FALSE, fetch_files = FALSE)
g_file_dslevin2016dmel <- paste0(data_folder, "dslevin2016dmel.txt.gz")
utils::download.file(url = as.character(g_url_dslevin2016dmel$url[3]), destfile = g_file_dslevin2016dmel)
X_dslevin2016dmel <- read.table(gzfile(g_file_dslevin2016dmel), h = T, sep = '\t', as.is = T, row.names = 1, comment.char = "")
# filter poor quality samples
cm_dslevin2016dmel <- RAPToR::cor.gene_expr(X_dslevin2016dmel, X_dslevin2016dmel)
f_dslevin2016dmel <- which(0.6 > apply(cm_dslevin2016dmel, 1, quantile, probs = .99))
X_dslevin2016dmel <- X_dslevin2016dmel[, -f_dslevin2016dmel]
# convert to rpkm & FBgn
X_dslevin2016dmel <- RAPToR::format_ids(X_dslevin2016dmel, dmel_genes, from = "fb_id", to = "fb_id")
X_dslevin2016dmel <- raw2rpkm(X = X_dslevin2016dmel, gene.length = dmel_genes, id.col = "fb_id", l.col = "transcript_length")
# pheno data
P_dslevin2016dmel <- Biobase::pData(GEOquery::getGEO(geo_dslevin2016dmel, getGPL = F)[[1]])
# filter relevant fields/samples
P_dslevin2016dmel <- P_dslevin2016dmel[, c("title", "geo_accession", "time (minutes cellularization stage):ch1")]
colnames(P_dslevin2016dmel)[3] <- "time"
P_dslevin2016dmel$title <- as.character(P_dslevin2016dmel$title)
P_dslevin2016dmel <- P_dslevin2016dmel[P_dslevin2016dmel$title %in% colnames(X_dslevin2016dmel),]
X_dslevin2016dmel <- X_dslevin2016dmel[, P_dslevin2016dmel$title]
# formatting
P_dslevin2016dmel$title <- gsub('Metazome_Drosophila_timecourse_', '', P_dslevin2016dmel$title)
colnames(X_dslevin2016dmel) <- P_dslevin2016dmel$title
P_dslevin2016dmel$age <- as.numeric(P_dslevin2016dmel$time) / 60
dslevin2016dmel <- list(g = X_dslevin2016dmel, p = P_dslevin2016dmel)
save(dslevin2016dmel, file = paste0(data_folder, "dslevin2016dmel.RData"), compress = "xz")
# cleanup
file.remove(g_file_dslevin2016dmel)
rm(geo_dslevin2016dmel, g_url_dslevin2016dmel, g_file_dslevin2016dmel,
f_dslevin2016dmel, cm_dslevin2016dmel, X_dslevin2016dmel, P_dslevin2016dmel)
dslevin2016cel
geo_dslevin2016cel <- "GSE60755"
g_url_dslevin2016cel <- GEOquery::getGEOSuppFiles(geo_dslevin2016cel, makeDirectory = FALSE, fetch_files = FALSE)
g_file_dslevin2016cel <- paste0(data_folder, "dslevin2016cel.txt.gz")
utils::download.file(url = as.character(g_url_dslevin2016cel$url[1]), destfile = g_file_dslevin2016cel)
X_dslevin2016cel <- read.table(gzfile(g_file_dslevin2016cel), h = T, sep = '\t', as.is = T, row.names = 1, comment.char = "")
# filter poor quality samples
cm_dslevin2016cel <- RAPToR::cor.gene_expr(X_dslevin2016cel, X_dslevin2016cel)
f_dslevin2016cel <- which(0.67 > apply(cm_dslevin2016cel, 1, quantile, probs = .99))
X_dslevin2016cel <- X_dslevin2016cel[, -f_dslevin2016cel]
# convert to rpkm & FBgn
X_dslevin2016cel <- RAPToR::format_ids(X_dslevin2016cel, wormRef::Cel_genes, from = "sequence_name", to = "wb_id")
X_dslevin2016cel <- raw2rpkm(X = X_dslevin2016cel, gene.length = wormRef::Cel_genes, id.col = "wb_id", l.col = "transcript_length")
# pheno data
P_dslevin2016cel <- Biobase::pData(GEOquery::getGEO(geo_dslevin2016cel, getGPL = F)[[1]])
# filter relevant fields/samples
P_dslevin2016cel <- P_dslevin2016cel[, c("title", "geo_accession", "time point (minutes after 4-cell):ch1")]
colnames(P_dslevin2016cel)[3] <- "time"
P_dslevin2016cel$title <- as.character(P_dslevin2016cel$title)
P_dslevin2016cel <- P_dslevin2016cel[P_dslevin2016cel$title %in% colnames(X_dslevin2016cel),]
# formatting
P_dslevin2016cel$age <- as.numeric(P_dslevin2016cel$time) / 60
P_dslevin2016cel <- P_dslevin2016cel[order(P_dslevin2016cel$age),]
X_dslevin2016cel <- X_dslevin2016cel[, P_dslevin2016cel$title]
P_dslevin2016cel$title <- gsub('Metazome_CE_timecourse_', '', P_dslevin2016cel$title)
colnames(X_dslevin2016cel) <- P_dslevin2016cel$title
X_dslevin2016cel <- X_dslevin2016cel[,-127] # remove extra outlier
P_dslevin2016cel <- P_dslevin2016cel[-127,]
dslevin2016cel <- list(g = X_dslevin2016cel, p = P_dslevin2016cel)
save(dslevin2016cel, file = paste0(data_folder, "dslevin2016cel.RData"), compress = "xz")
# cleanup
file.remove(g_file_dslevin2016cel)
rm(geo_dslevin2016cel, g_url_dslevin2016cel, g_file_dslevin2016cel,
f_dslevin2016cel, cm_dslevin2016cel, X_dslevin2016cel, P_dslevin2016cel)
dsgraveley2011
g_url_dsgraveley2011 <- "ftp://ftp.fruitfly.org/pub/download/modencode_expression_scores/Celniker_Drosophila_Annotation_20120616_1428_allsamps_MEAN_gene_expression.csv.gz"
g_file_dsgraveley2011 <- paste0(data_folder, "dsgraveley2011.csv.gz")
utils::download.file(g_url_dsgraveley2011, destfile = g_file_dsgraveley2011)
X_dsgraveley2011 <- read.table(gzfile(g_file_dsgraveley2011), sep = ',', row.names = 1, h = T)
# convert gene ids to FBgn
X_dsgraveley2011 <- RAPToR::format_ids(X_dsgraveley2011, dmel_genes, from = "gene_name", to = "fb_id")
# select embryo time series samples
X_dsgraveley2011 <- X_dsgraveley2011[,1:12]
P_dsgraveley2011 <- data.frame(sname = colnames(X_dsgraveley2011),
age = as.numeric(gsub("em(\\d+)\\.\\d+hr", "\\1", colnames(X_dsgraveley2011))),
stringsAsFactors = FALSE)
dsgraveley2011 <- list(g = X_dsgraveley2011, p = P_dsgraveley2011)
save(dsgraveley2011, file = paste0(data_folder, "dsgraveley2011.RData"), compress = "xz")
# cleanup
file.remove(g_file_dsgraveley2011)
rm(g_url_dsgraveley2011, g_file_dsgraveley2011, X_dsgraveley2011, P_dsgraveley2011)
dsgraveley2011$g <- limma::normalizeBetweenArrays(dsgraveley2011$g, method = "quantile")
dsgraveley2011$g <- log(dsgraveley2011$g + 1)
dslevin2016dmel$g <- limma::normalizeBetweenArrays(dslevin2016dmel$g, method = "quantile")
dslevin2016dmel$g <- log(dslevin2016dmel$g + 1)
dslevin2016cel$g <- limma::normalizeBetweenArrays(dslevin2016cel$g, method = "quantile")
dslevin2016cel$g <- log(dslevin2016cel$g + 1)
dslevin2016cel$g_cel <- format_ids(dslevin2016cel$g, glist, from = "wb_id", to = "wb_id")
dslevin2016dmel$g_cel <- format_ids(dslevin2016dmel$g, glist, from = "fb_id", to = "wb_id")
We know from previous use of the dslevin2016dmel dataset that its samples have rather imprecise chronological-developmental synchronicity (see the refbuilding vignette’s 2nd example for more information). We’ll first estimate the age of the samples and use that as a basis for the comparison when staging on C. elegans data.
m_grav <- ge_im(X = dsgraveley2011$g, p = dsgraveley2011$p, formula = "X ~ s(age, bs = 'cr')", nc = 4)
n.inter <- 200
newdat <- data.frame(
age = seq(min(dsgraveley2011$p$age), max(dsgraveley2011$p$age), l = n.inter)
)
r_grav <- list(interpGE = predict(m_grav, newdata = newdat), time.series = newdat$age)
ae_dmel <- ae(dslevin2016dmel$g, r_grav$interpGE, r_grav$time.series)
dslevin2016dmel$p$ae <- ae_dmel$age.estimates[,1]
As we have components that are rather noisy with this data, we’ll only use 5 components to build the reference.
pca_cel <- stats::prcomp(dslevin2016cel$g_cel, rank = 10)
nc <- 5
m_cel <- ge_im(X = dslevin2016cel$g_cel, p = dslevin2016cel$p, formula = "X ~ s(age, bs = 'cr')", nc = nc)
n.inter <- 200 # nb of new timepoints
newdat <- data.frame(
age = seq(min(dslevin2016cel$p$age), max(dslevin2016cel$p$age), l = n.inter)
)
pred_cel_comp <- predict(m_cel, newdata = newdat, as.c = T) # for plotting
r_cel <- list(interpGE = predict(m_cel, newdata = newdat), time.series = newdat$age)
Check the reference by staging its samples on it :
ae_cel_on_cel <- ae(dslevin2016cel$g_cel, r_cel$interpGE, r_cel$time.series)
ae_dmel_on_cel <- ae(dslevin2016dmel$g_cel, r_cel$interpGE, r_cel$time.series)
We have already built a D. melanogaster with dsgraveley2011, which we can use directly here.
dslevin2016cel$g_dmel <- format_ids(dslevin2016cel$g, glist, from = "wb_id", to = "fb_id")
ae_cel_on_dmel <- ae(dslevin2016cel$g_dmel, r_grav$interpGE, r_grav$time.series)
It is known that within some species, tissues can develop at different rates, which can vary between individuals. In C. elegans, this developmental heterochrony has been shown between soma and germline (Perez et al. (2017)). By restricting the gene subset on which RAPToR stages the samples to specific tissues, it is possible to get estimates for development specific to these tissues.
We’ll be working with a dataset published by Rockman, Skrovanek, and Kruglyak (2010).
This dataset correpsonds to microarray profiling of 208 recombinant inbred lines of C. elegans N2 and Hawaii (CB4856) strains. These 208 samples were described as “developmentally synchronized” in the original paper. However, it was later demonstrated that a very significant developmental spread of the samples existed, spanning around 20 hours of \(20^\circ C\) late larval development existed (Francesconi and Lehner (2014)).
This essentially makes this dataset a very high-resolution timecourse of late-larval development.
We want to observe the C. elegans heterochrony of soma and germline development.
Using RAPToR, we can capture tissue-specific age by restricting the geneset used for staging to the tissue of interest.
We define a germline geneset of 2554 genes from joining the germline_intrinsic, germline_oogenesis_enriched and germline_sperm_enriched categories defined in Perez et al. (2017).
We define a soma geneset of 2718 genes from the osc gene category defined in Hendriks et al. (2014).
A first staging is done with all available genes : this corresponds to the “Global age”.
Then, tissue-specific staging is simply done by limiting the genes to the previously mentioned sets. The estimates from the soma and germline gene subsets are called “Soma age” and “Germline age” respectively.
We can look at PCA or ICA components against the different ages to evaluate the results. We can e.g. expect components corresponding to molting (oscillatory) processes to be “cleaner” with the soma estimates, consequently introducing noise on the germline components and vice-versa.
Code to generate dsrockman2010, francesconi_time and gsubset.
Note : set the data_folder variable to an existing path on your system where you want to store the objects.
data_folder <- "../inst/extdata/"
requireNamespace("wormRef", quietly = T)
requireNamespace("utils", quietly = T)
requireNamespace("GEOquery", quietly = T) # May need to be installed with bioconductor
requireNamespace("Biobase", quietly = T)
requireNamespace("limma", quietly = T)
geo_dsrockman2010 <- "GSE23857"
geo_dsrockman2010 <- GEOquery::getGEO(geo_dsrockman2010, GSEMatrix = F)
# get pdata
P_dsrockman2010 <- do.call(rbind, lapply(GEOquery::GSMList(geo_dsrockman2010), function(go){
unlist(GEOquery::Meta(go)[
c("geo_accession", "characteristics_ch1", "characteristics_ch2",
"label_ch1", "label_ch2")]
)
}))
P_dsrockman2010 <- as.data.frame(P_dsrockman2010, stringsAsFactors = F)
# get RG data from microarray
RG_dsrockman2010 <- list(
R = do.call(cbind, lapply(GEOquery::GSMList(geo_dsrockman2010), function(go){
GEOquery::Table(go)[, "R_MEAN_SIGNAL"]
})),
G = do.call(cbind, lapply(GEOquery::GSMList(geo_dsrockman2010), function(go){
GEOquery::Table(go)[, "G_MEAN_SIGNAL"]
}))
)
# normalize within channels
RG_dsrockman2010 <- limma::normalizeWithinArrays(RG_dsrockman2010, method = "loess")
RG_dsrockman2010 <- limma::RG.MA(RG_dsrockman2010) # convert back from MA to RG
# get only the RIALs (not the mixed stage controls)
X_dsrockman2010 <- RG_dsrockman2010$R
X_dsrockman2010[, P_dsrockman2010$label_ch1 == "Cy5"] <- RG_dsrockman2010$G[, P_dsrockman2010$label_ch1 == "Cy5"]
# format probe/gene ids
gpl <- GEOquery::getGEO(GEOquery::Meta(GEOquery::GSMList(geo_dsrockman2010)[[1]])$platform_id)
gpl <- GEOquery::Table(gpl)
gpl <- gpl[as.character(gpl$ID) %in% as.character(GEOquery::Table(GEOquery::GSMList(geo_dsrockman2010)[[1]])$ID_REF), ]
sel <- gpl$ORF%in%wormRef::Cel_genes$sequence_name
gpl <- gpl[sel,]
X_dsrockman2010 <- X_dsrockman2010[sel,]
# filter bad quality samples
cm_dsrockman2010 <- cor(log1p(X_dsrockman2010), method = 'spearman')
f_dsrockman2010 <- which(0.95 > apply(cm_dsrockman2010, 1, quantile, probs = .95))
X_dsrockman2010 <- X_dsrockman2010[,-f_dsrockman2010]
P_dsrockman2010 <- P_dsrockman2010[-f_dsrockman2010,]
rownames(X_dsrockman2010) <- gpl$ORF
X_dsrockman2010 <- RAPToR::format_ids(X_dsrockman2010, wormRef::Cel_genes,
from = "sequence_name", to = "wb_id")
dsrockman2010 <- list(g = X_dsrockman2010, p = P_dsrockman2010)
save(dsrockman2010, file = paste0(data_folder, "dsrockman2010.RData"), compress = "xz")
rm(geo_dsrockman2010, gpl, sel, RG_dsrockman2010, X_dsrockman2010, P_dsrockman2010,
cm_dsrockman2010, f_dsrockman2010)
library(readxl)
germline_url <- "https://static-content.springer.com/esm/art%3A10.1038%2Fnature25012/MediaObjects/41586_2017_BFnature25012_MOESM3_ESM.xlsx"
germline_file <- paste0(data_folder, "germline_gset.xlsx")
utils::download.file(url = germline_url, destfile = germline_file)
germline_set <- read_xlsx(germline_file, sheet = 3, na = "NA")[,c(1, 44:46)]
germline_set[is.na(germline_set)] <- FALSE
germline_set <- cbind(wb_id = germline_set[,1],
germline = apply(germline_set[, 2:4], 1, function(r) any(r)),
germline_set[, 2:4])
# germline_set$germline[is.na(germline_set$germline)] <- FALSE
germline <- germline_set[germline_set$germline,1]
germline_intrinsic <- germline_set[germline_set$germline_intrinsic,1]
germline_oogenesis <- germline_set[germline_set$germline_oogenesis_enriched,1]
germline_sperm <- germline_set[germline_set$germline_sperm_enriched,1]
soma_url <- "https://ars.els-cdn.com/content/image/1-s2.0-S1097276513009039-mmc2.xlsx"
soma_file <- paste0(data_folder, "soma_gset.xlsx")
utils::download.file(url = soma_url, destfile = soma_file)
soma_set <- read_xlsx(soma_file, skip = 3, na = "NA")[,c(1, 4)]
soma_set$class <- factor(soma_set$class)
soma_set$soma <- soma_set$class == "osc"
soma_set <- soma_set[soma_set$soma, 1]
gsubset <- list(germline = germline, soma = soma_set$`Gene WB ID`,
germline_intrinsic = germline_intrinsic,
germline_oogenesis = germline_oogenesis,
germline_sperm = germline_sperm)
save(gsubset, file = paste0(data_folder, "sc3_gsubset.RData"), compress = "xz")
file.remove(germline_file)
file.remove(soma_file)
rm(germline_url, germline_file, germline_set, soma_url, soma_file, soma_set)
# Copied from supp data of Francesconi & Lehner (2014)
francesconi_time <- data.frame(time =
c(4.862660944, 4.957081545, 5.051502146, 5.145922747, 5.240343348, 5.334763948, 5.429184549, 5.52360515,
5.618025751, 5.712446352, 5.806866953, 5.901287554, 5.995708155, 6.090128755, 6.184549356, 6.278969957,
6.373390558, 6.467811159, 6.56223176, 6.656652361, 6.751072961, 6.845493562, 6.939914163, 7.034334764,
7.128755365, 7.223175966, 7.317596567, 7.412017167, 7.506437768, 7.600858369, 7.69527897, 7.789699571,
7.884120172, 7.978540773, 8.072961373, 8.167381974, 8.261802575, 8.356223176, 8.450643777, 8.545064378,
8.639484979, 8.733905579, 8.82832618, 8.82832618, 8.82832618, 8.875536481, 8.875536481, 8.875536481,
8.875536481, 8.875536481, 8.875536481, 8.875536481, 8.875536481, 8.875536481, 8.969957082, 9.017167382,
9.017167382, 9.064377682, 9.064377682, 9.111587983, 9.206008584, 9.206008584, 9.206008584, 9.300429185,
9.394849785, 9.489270386, 9.489270386, 9.489270386, 9.489270386, 9.489270386, 9.583690987, 9.583690987,
9.583690987, 9.583690987, 9.583690987, 9.583690987, 9.583690987, 9.583690987, 9.630901288, 9.725321888,
9.819742489, 9.819742489, 9.819742489, 9.819742489, 9.819742489, 9.819742489, 9.819742489, 9.91416309,
10.00858369, 10.05579399, 10.05579399, 10.05579399, 10.05579399, 10.10300429, 10.19742489, 10.19742489,
10.29184549, 10.29184549, 10.29184549, 10.38626609, 10.38626609, 10.38626609, 10.43347639, 10.43347639,
10.43347639, 10.43347639, 10.43347639, 10.43347639, 10.43347639, 10.43347639, 10.43347639, 10.43347639,
10.43347639, 10.43347639, 10.527897, 10.6223176, 10.6223176, 10.6223176, 10.6223176, 10.6223176, 10.6223176,
10.6223176, 10.6695279, 10.6695279, 10.6695279, 10.7639485, 10.7639485, 10.7639485, 10.8583691, 10.8583691,
10.9527897, 10.9527897, 10.9527897, 11.0472103, 11.1416309, 11.2360515, 11.2360515, 11.3304721, 11.3304721,
11.3776824, 11.472103, 11.56652361, 11.66094421, 11.75536481, 11.84978541, 11.94420601, 12.03862661, 12.13304721,
12.22746781, 12.32188841, 12.41630901, 12.51072961, 12.60515021, 12.69957082, 12.79399142, 12.88841202, 12.98283262,
13.07725322, 13.17167382, 13.26609442, 13.36051502, 13.45493562, 13.54935622, 13.54935622, 13.54935622, 13.54935622,
13.54935622, 13.59656652, 13.69098712, 13.78540773, 13.78540773, 13.78540773, 13.87982833, 13.97424893, 14.06866953,
14.06866953, 14.06866953, 14.16309013, 14.25751073, 14.35193133, 14.44635193, 14.54077253, 14.63519313, 14.72961373,
14.82403433, 14.82403433, 14.82403433, 14.91845494, 14.96566524, 15.01287554, 15.10729614, 15.20171674, 15.29613734,
15.39055794, 15.48497854, 15.57939914, 15.67381974, 15.76824034, 15.86266094, 15.95708155, 16.05150215, 16.14592275,
16.24034335, 16.33476395, 16.42918455, 16.52360515),
geo_accession =
c("GSM588291", "GSM588174", "GSM588110", "GSM588097", "GSM588271", "GSM588203", "GSM588105", "GSM588200", "GSM588123",
"GSM588122", "GSM588115", "GSM588100", "GSM588171", "GSM588190", "GSM588229", "GSM588206", "GSM588277", "GSM588129",
"GSM588175", "GSM588151", "GSM588273", "GSM588216", "GSM588099", "GSM588117", "GSM588179", "GSM588164", "GSM588184",
"GSM588092", "GSM588285", "GSM588272", "GSM588228", "GSM588121", "GSM588170", "GSM588194", "GSM588143", "GSM588149",
"GSM588156", "GSM588220", "GSM588212", "GSM588089", "GSM588209", "GSM588253", "GSM588091", "GSM588113", "GSM588130",
"GSM588202", "GSM588191", "GSM588244", "GSM588227", "GSM588197", "GSM588233", "GSM588292", "GSM588163", "GSM588196",
"GSM588224", "GSM588283", "GSM588267", "GSM588257", "GSM588221", "GSM588274", "GSM588090", "GSM588114", "GSM588195",
"GSM588265", "GSM588182", "GSM588093", "GSM588157", "GSM588251", "GSM588177", "GSM588188", "GSM588269", "GSM588145",
"GSM588205", "GSM588162", "GSM588210", "GSM588166", "GSM588125", "GSM588252", "GSM588207", "GSM588173", "GSM588102",
"GSM588286", "GSM588107", "GSM588238", "GSM588189", "GSM588106", "GSM588295", "GSM588192", "GSM588134", "GSM588183",
"GSM588103", "GSM588198", "GSM588293", "GSM588218", "GSM588259", "GSM588234", "GSM588137", "GSM588152", "GSM588133",
"GSM588250", "GSM588168", "GSM588235", "GSM588148", "GSM588279", "GSM588140", "GSM588241", "GSM588111", "GSM588231",
"GSM588128", "GSM588131", "GSM588101", "GSM588088", "GSM588281", "GSM588159", "GSM588249", "GSM588290", "GSM588118",
"GSM588154", "GSM588136", "GSM588268", "GSM588204", "GSM588160", "GSM588135", "GSM588098", "GSM588294", "GSM588225",
"GSM588181", "GSM588248", "GSM588096", "GSM588217", "GSM588147", "GSM588176", "GSM588116", "GSM588146", "GSM588127",
"GSM588104", "GSM588108", "GSM588262", "GSM588223", "GSM588161", "GSM588237", "GSM588172", "GSM588284", "GSM588256",
"GSM588165", "GSM588211", "GSM588242", "GSM588169", "GSM588240", "GSM588264", "GSM588219", "GSM588287", "GSM588124",
"GSM588178", "GSM588167", "GSM588258", "GSM588232", "GSM588141", "GSM588112", "GSM588208", "GSM588215", "GSM588132",
"GSM588278", "GSM588275", "GSM588155", "GSM588153", "GSM588109", "GSM588185", "GSM588138", "GSM588094", "GSM588226",
"GSM588236", "GSM588266", "GSM588119", "GSM588222", "GSM588246", "GSM588150", "GSM588261", "GSM588201", "GSM588247",
"GSM588186", "GSM588280", "GSM588255", "GSM588288", "GSM588260", "GSM588139", "GSM588245", "GSM588270", "GSM588276",
"GSM588199", "GSM588254", "GSM588120", "GSM588144", "GSM588243", "GSM588214", "GSM588180", "GSM588126", "GSM588282",
"GSM588187", "GSM588158", "GSM588193", "GSM588213", "GSM588230", "GSM588263", "GSM588142", "GSM588095"),
stringsAsFactors = F
)
rownames(francesconi_time) <- francesconi_time$geo_accession
dsrockman2010$g <- limma::normalizeBetweenArrays(dsrockman2010$g, method = "quantile")
dsrockman2010$g <- log(dsrockman2010$g + 1)
We can use one of the young-adult references for C. elegans to stage the samples
r_ya <- prepare_refdata("Cel_YA_1", "wormRef", n.inter = 400)
ae_dsrockman2010 <- ae(dsrockman2010$g, r_ya$interpGE, r_ya$time.series)
We can check the estimates against the previously estimated ages on this data from Francesconi and Lehner (2014). These are in the supplementary data of the paper.
We can use ICA components to look at the dynamics of the dataset on a global scale. Some of the components can be associated to specific developmental processes. Looking at the gene loadings (or contributions) on the components allows us to establish enrichment in gene sets.
First, we can decide how many components to extract by looking at how many PCA components are needed to explain \(99\%\) of the variance.
pca_rock <- summary(stats::prcomp(dsrockman2010$g, rank = 30))
sum(pca_rock$importance["Cumulative Proportion",] < .99) + 1
#> [1] 28
ica_rock <- ica::icafast(dsrockman2010$g, nc = 28)
We can plot all components along the estimated age of the samples :
A lot of these components don’t have a definite link to developmental processes. We can have a closer look at the components that clearly capture development, and look at their gene loadings’ enrichment in gene sets of interest.
dev_comps <- c(1, 2, 5, 7, 8, 9)
From the plots above, we can establish that
Now, we stage the samples using only germline or soma gene subsets.
ae_soma <- ae(
dsrockman2010$g[rownames(dsrockman2010$g)%in%gsubset$soma,], # select soma gene subset
r_ya$interpGE, r_ya$time.series
)
ae_germline <- ae(
dsrockman2010$g[rownames(dsrockman2010$g)%in%gsubset$germline,], # select germline gene subset
r_ya$interpGE, r_ya$time.series
)
We can notice on the soma estimates that a sample appears quite off from its global or germline age (it’s marked with a \(\small{\triangle}\)). This sort of artefact can happen when using small genesets for estimates, as similar expression profiles can occur at different times (which is especially true for oscillatory profiles).
If we look at the correlation profile of this sample on the 3 estimates, we can see 2 peaks for the soma:
This scenario is a prime example of when it is appropriate to use a prior. We can input the global age as a prior to favor the first peak of the soma correlation profile.
ae_soma_prior <- ae(
dsrockman2010$g[rownames(dsrockman2010$g)%in%gsubset$soma,], # select soma gene subset
r_ya$interpGE, r_ya$time.series,
prior = ae_dsrockman2010$age.estimates[,1], # gaussian prior values (mean)
prior.params = 10 # gaussian prior sd
)
This now shifts our estimate to the first peak (note that the correlation profile itself is not changed).
At the same time, all of our other estimates essentially stay the same.
# 80 is the index of the GSM588171 sample
summary(ae_soma$age.estimates[-80, 1] - ae_soma_prior$age.estimates[-80, 1])
#> Min. 1st Qu. Median Mean 3rd Qu. Max.
#> -1.06496 0.00000 0.00000 -0.03407 0.00000 1.06496
Now, we can look once more at the components of our data using our tissue-specific estimates. We’ll use the prior version of the soma estimates for the plots below.
Notice how components IC5, IC7 and IC8, previously associated with soma, are much cleaner when plotting with the soma age. Also, see that the IC1, IC2 and IC9 germline-associated components appear quite noisy.
With the germline estimates, we get the opposite effect : much cleaner dynamics on the first two components and IC9, at the cost of clarity on the oscillatory dynamics.
Francesconi, Mirko, and Ben Lehner. 2014. “The Effects of Genetic Variation on Gene Expression Dynamics During Development.” Nature 505 (7482). Nature Publishing Group: 208.
Graveley, Brenton R, Angela N Brooks, Joseph W Carlson, Michael O Duff, Jane M Landolin, Li Yang, Carlo G Artieri, et al. 2011. “The Developmental Transcriptome of Drosophila Melanogaster.” Nature 471 (7339). Nature Publishing Group: 473.
Hendriks, Gert-Jan, Dimos Gaidatzis, Florian Aeschimann, and Helge Großhans. 2014. “Extensive Oscillatory Gene Expression During c. Elegans Larval Development.” Molecular Cell 53 (3). Elsevier: 380–92.
Lehrbach, Nicolas J, Cecilia Castro, Kenneth J Murfitt, Cei Abreu-Goodger, Julian L Griffin, and Eric A Miska. 2012. “Post-Developmental microRNA Expression Is Required for Normal Physiology, and Regulates Aging in Parallel to Insulin/Igf-1 Signaling in c. Elegans.” Rna 18 (12). Cold Spring Harbor Lab: 2220–35.
Levin, Michal, Leon Anavy, Alison G Cole, Eitan Winter, Natalia Mostov, Sally Khair, Naftalie Senderovich, et al. 2016. “The Mid-Developmental Transition and the Evolution of Animal Body Plans.” Nature 531 (7596). Nature Publishing Group: 637.
Li, Jingyi Jessica, Haiyan Huang, Peter J Bickel, and Steven E Brenner. 2014. “Comparison of d. Melanogaster and c. Elegans Developmental Stages, Tissues, and Cells by modENCODE Rna-Seq Data.” Genome Research 24 (7). Cold Spring Harbor Lab: 1086–1101.
Perez, Marcos Francisco, Mirko Francesconi, Cristina Hidalgo-Carcedo, and Ben Lehner. 2017. “Maternal Age Generates Phenotypic Variation in Caenorhabditis Elegans.” Nature 552 (7683). Nature Publishing Group: 106.
Rockman, Matthew V, Sonja S Skrovanek, and Leonid Kruglyak. 2010. “Selection at Linked Sites Shapes Heritable Phenotypic Variation in c. Elegans.” Science 330 (6002). American Association for the Advancement of Science: 372–76.